Published on

NetApp cluster recovery after power outage

Authors

Starting point

To run an experimental platform I received a pre-configured NetApp. There is no support for it. Until now I had never worked with NetApp or any other SAN. This NetApp primarily serves an HPE DL380 G10 which boots ESXi from a LUN on the NetApp. Suddenly nothing worked, the NetApp was off. First thought: power outage? So, boot. I connected both nodes via Micro-USB cable and used PuTTY on COM20 and COM21 at 115200 baud.

Recovering the cluster

COM20

LOADER-A> boot_ontap

[...blabla...]
May 23 14:03:39 [netapp05-02:mgr.boot.unequalDist:error]: Warning: Unequal number of disks will be used for auto-partitioning of the root aggregate on the local system and HA partner. The local system will use 8 disks but the HA partner will use 6 disks. To correct this situation, boot both controllers into maintenance mode and remove the ownership of all disks.
May 23 14:03:39 [netapp05-02:fmmb.disk.notAccsble:notice]: All Local mailbox disks are inaccessible.
May 23 14:03:39 [netapp05-02:fmmb.disk.notAccsble:notice]: All Partner mailbox disks are inaccessible.
May 23 14:03:39 [netapp05-02:raid.assim.disk.brokenPreAssim:error]: Broken Disk 0b.05.9P2 Shelf 5 Bay 9 [NETAPP   X427_HCBFE1T8A10 NA06] S/N [08HJ1LJANP002] UID [6000CCA0:2C558DC0:500A0981:00000002:00000000:00000000:00000000:00000000:00000000:00000000] detected prior to assimilation.
May 23 14:03:39 [netapp05-02:kern.syslog.msg:notice]: FAILOVER: fmrsrc_startSecondary() - TakeOver for fmdisk_reserve done in 20 msecs (Since TO started: 20)

May 23 14:03:39 [netapp05-01:raid.assim.disk.brokenPreAssim:error]: Broken Disk 0a.05.6P1 Shelf 5 Bay 6 [NETAPP   X427_HCBFE1T8A10 NA06] S/N [08HJ5RVANP001] UID [6000CCA0:2C55CBE8:500A0981:00000001:00000000:00000000:00000000:00000000:00000000:00000000] detected prior to assimilation.
May 23 14:03:39 [netapp05-01:raid.assim.disk.brokenPreAssim:error]: Broken Disk 0b.05.9P1 Shelf 5 Bay 9 [NETAPP   X427_HCBFE1T8A10 NA06] S/N [08HJ1LJANP001] UID [6000CCA0:2C558DC0:500A0981:00000001:00000000:00000000:00000000:00000000:00000000:00000000] detected prior to assimilation.
May 23 14:03:39 [netapp05-01:raid.assim.disk.brokenPreAssim:error]: Broken Disk 0a.05.6P2 Shelf 5 Bay 6 [NETAPP   X427_HCBFE1T8A10 NA06] S/N [08HJ5RVANP002] UID [6000CCA0:2C55CBE8:500A0981:00000002:00000000:00000000:00000000:00000000:00000000:00000000] detected prior to assimilation.
May 23 14:03:40 [netapp05-02:kern.syslog.msg:notice]: FAILOVER: fmrsrc_startSecondary() - TakeOver for raid done in 274 msecs (Since TO started: 294)
[...]

May 23 14:03:42 [netapp05-02:LUN.nvfail.vol.proc.complete:error]: LUNs in volume IIL_4 (DSID 1314) have been brought offline because an inconsistency was detected in the nvlog during boot or takeover.
May 23 14:03:42 [netapp05-02:kern.syslog.msg:notice]: The system was down for 73786 seconds
May 23 14:03:42 [netapp05-02:cf.fsm.takeoverByPartnerDisabled:error]: Failover monitor: takeover of netapp05-02 by netapp05-01 disabled (Already in takeover mode).
May 23 14:03:42 [netapp05-02:cf.fm.takeoverStarted:notice]: Failover monitor: takeover started
[...]

May 23 14:03:42 [netapp05-01:lmgr.sf.up.ready:notice]: Lock manager allowed high availability module to transition to the up state for the following reason: Partner down.
[...]

May 23 14:04:00 [netapp05-02:monitor.globalStatus.critical:EMERGENCY]: This node has taken over netapp05-01. Disk on adapter 0b, shelf 5, bay 9, failed.
May 23 14:04:18 [netapp05-02:callhome.root.vol.recovery.reqd:EMERGENCY]: Call home for ROOT VOLUME NOT WORKING PROPERLY: RECOVERY REQUIRED.

COM21

LOADER-B> boot_ontap

[...blabla...]
May 23 14:11:46 [netapp05-01:disk.init.failureBytes:error]: Failed disk 0b.05.12 detected during disk initialization.
Reservation conflict found on this node's disks!
[...]

Waiting for giveback...(Press Ctrl-C to abort wait)
This node was previously declared dead.
Pausing to check HA partner status ...
partner is operational and in takeover mode.

You must initiate a giveback or shutdown on the HA
partner in order to bring this node online.


The HA partner is currently operational and in takeover mode.This node cannot continue unless you initiate a giveback on the partner.
Once this is done this node will reboot automatically.

waiting for giveback...

Uh oh, unhealthy.

Fixing

So, log in to Node A:

login: 
Password:
******************************************************
* This is a serial console session. Output from this *
* session is mirrored on the SP console session.     *
******************************************************
***********************
**  SYSTEM MESSAGES  **
***********************

Internal error. Cannot open corrupt replicated database. Automatic recovery
attempt has failed or is disabled. Check the event logs for details. This node
is not fully operational. Contact support personnel for the root volume recovery
procedures.

In the meantime Node B finished booting:

Partner has released takeover lock.
Continuing boot...
[...]
May 23 14:21:51 [netapp05-01:disk.dynamicqual.fail.parse:error]: Device qualification information file (/etc/qual_devices) is invalid. The following error, " Unsupported File version detected.
" has been detected. For further information about correcting the problem, search the knowledgebase of the NetApp technical support support web site for the "[disk.dynamicqual.fail.parse]" keyword.
[...]
May 23 14:21:51 [netapp05-01:cf.fsm.takeoverOfPartnerDisabled:error]: Failover monitor: takeover of netapp05-02 disabled (unsynchronized log).
May 23 14:21:52 [netapp05-01:raid.fdr.reminder:error]: Failed Disk 0a.05.6 Shelf 5 Bay 6 [NETAPP   X427_HCBFE1T8A10 NA06] S/N [08HJ5RVA] UID [5000CCA0:2C55CBE8:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000] is still present in the system and should be removed.
May 23 14:21:52 [netapp05-01:raid.fdr.reminder:error]: Failed Disk 0b.05.9 Shelf 5 Bay 9 [NETAPP   X427_HCBFE1T8A10 NA06] S/N [08HJ1LJA] UID [5000CCA0:2C558DC0:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000] is still present in the system and should be removed.

And yes, disks are obviously broken!

Now Node B seems to be in better shape: Node A:

netapp05::> cluster show
Error: "show" is not a recognized command

Node B:

netapp05::> cluster show
Node                  Health  Eligibility
--------------------- ------- ------------
netapp05-01           true    true
netapp05-02           false   true
2 entries were displayed.

Let's try a recovery:

netapp05::*> system node show
Node      Health Eligibility Uptime        Model       Owner    Location
--------- ------ ----------- ------------- ----------- -------- ---------------
""        -      -                       - -           -        -
netapp05-02
          -      -                00:37:41 FAS2650

Warning: Cluster HA has not been configured.  Cluster HA must be configured on
         a two-node cluster to ensure data access availability in the event of
         storage failover. Use the "cluster ha modify -configured true" command
         to configure cluster HA.
2 entries were displayed.

netapp05::*> system configuration backup show
Node       Backup Name                               Time               Size
---------  ----------------------------------------- ------------------ -----
netapp05-02
           netapp05.8hour.2025-05-12.18_15_03.7z     05/12 19:15:03     76.00MB
netapp05-02
           netapp05.8hour.2025-05-13.02_15_03.7z     05/13 03:15:03     76.65MB
netapp05-02
           netapp05.daily.2025-05-12.00_10_03.7z     05/12 01:10:03     76.90MB
netapp05-02
           netapp05.daily.2025-05-13.00_10_03.7z     05/13 03:15:03     76.25MB
netapp05-02
           netapp05.weekly.2025-05-04.00_15_03.7z    05/04 01:15:03     77.49MB
netapp05-02
           netapp05.weekly.2025-05-11.00_15_03.7z    05/11 01:15:03     77.75MB
6 entries were displayed.

netapp05::*> system configuration recovery node restore -backup  netapp05.8hour.2025-05-13.02_15_03.7z

Warning: This command overwrites local configuration files with files contained
         in the specified backup file. Use this command only to recover from a
         disaster that resulted in the loss of the local configuration files.
         The node will reboot after restoring the local configuration.
Do you want to continue? {y|n}: y
Verifying that the node is offline in the cluster.
Verifying that the backup tarball exists.
Extracting the backup tarball.
Verifying that software and hardware of the node match with the backup.
Stopping cluster applications.  

After the reboot, unfortunately everything was still the same. I tried an older backup.

These approaches were now open:

varfs_backup_restore: bootarg.abandon_varfs is set! Skipping /var backup.

This is probably due to the backup restore. But this?

*********************************************
* ALERT: SHA256 checksum failure detected   *
*        in boot device                     *
*                                           *
* Contact technical support for assistance. *
*********************************************
ERROR: netapp_varfs: SHA256 checksum failure detected in boot device. Contact technical support for assistance.
[...]
May 26 07:56:34 [netapp05-02:callhome.root.vol.recovery.reqd:EMERGENCY]: Call home for ROOT VOLUME NOT WORKING PROPERLY: RECOVERY REQUIRED.

This led me to the following KB: https://kb.netapp.com/on-prem/ontap/OHW/OHW-KBs/System_does_not_start_after_reboot_due_to_Unable_to_recover_the_local_database_of_Data_Replication_Module But of the 3 ENV entries, 2 are unknown...

LOADER-A> unsetenv bootarg.rdb_corrupt
LOADER-A> unsetenv bootarg.init.boot_recovery
LOADER-A> unsetenv bootarg.rdb_corrupt.mgwd
LOADER-A> saveenv
LOADER-A> bye

EUREKA!

netapp05::> cluster show
Node                  Health  Eligibility
--------------------- ------- ------------
netapp05-01           true    true
netapp05-02           true    true
2 entries were displayed.

Reviving the aggregate

In the older NetApp there are now 3 SAS disks dying. Is it too late?

netapp05::> storage aggregate show


Aggregate     Size Available Used% State   #Vols  Nodes            RAID Status
--------- -------- --------- ----- ------- ------ ---------------- ------------
n01_SAS         0B        0B    0% failed       0 netapp05-01      raid_dp,
                                                                   partial
n01_root   368.4GB   17.85GB   95% online       1 netapp05-01      raid_dp,
                                                                   normal
n02_SSD    18.86TB   15.17TB   20% online      17 netapp05-02      raid_dp,
                                                                   normal
n02_root   368.4GB   17.85GB   95% online       1 netapp05-02      raid_dp,
                                                                   normal
4 entries were displayed.

netapp05::> storage disk show
                     Usable           Disk    Container   Container
Disk                   Size Shelf Bay Type    Type        Name      Owner
---------------- ---------- ----- --- ------- ----------- --------- --------

Info: This cluster has partitioned disks. To get a complete list of spare disk
      capacity use "storage aggregate show-spare-disks".
1.5.0                     -     5   0 unknown unsupported -         -
1.5.1                1.63TB     5   1 SAS     shared      n01_SAS   netapp05-02
1.5.2                1.63TB     5   2 SAS     shared      n01_SAS, n01_root
                                                                    netapp05-01
1.5.3                1.63TB     5   3 SAS     shared      n01_SAS, n02_root
                                                                    netapp05-02
1.5.4                1.63TB     5   4 SAS     shared      n01_SAS, n01_root
                                                                    netapp05-01
1.5.5                1.63TB     5   5 SAS     shared      n01_SAS, n02_root
                                                                    netapp05-02
1.5.6                1.63TB     5   6 SAS     broken      -         netapp05-01
1.5.7                1.63TB     5   7 SAS     shared      n01_SAS, n02_root
                                                                    netapp05-02
1.5.8                1.63TB     5   8 SAS     shared      n01_SAS, n01_root
                                                                    netapp05-01
1.5.9                1.63TB     5   9 SAS     broken      -         netapp05-02
1.5.10               1.63TB     5  10 SAS     shared      n01_SAS, n01_root
                                                                    netapp05-01
1.5.11               1.63TB     5  11 SAS     shared      n01_SAS, n02_root
                                                                    netapp05-02
1.5.12                    -     5  12 SAS     broken      -         -
1.5.13               1.63TB     5  13 SAS     shared      n01_SAS, n02_root
                                                                    netapp05-02
1.5.14               1.63TB     5  14 SAS     shared      n01_SAS, n01_root
                                                                    netapp05-01
1.5.15               3.49TB     5  15 SSD     aggregate   n02_SSD   netapp05-02
1.5.16               3.49TB     5  16 SSD     aggregate   n02_SSD   netapp05-02
1.5.17               3.49TB     5  17 SSD     aggregate   n02_SSD   netapp05-02
1.5.18               3.49TB     5  18 SSD     aggregate   n02_SSD   netapp05-02
1.5.19               3.49TB     5  19 SSD     aggregate   n02_SSD   netapp05-02
1.5.20               3.49TB     5  20 SSD     aggregate   n02_SSD   netapp05-02
1.5.21               3.49TB     5  21 SSD     aggregate   n02_SSD   netapp05-02
1.5.22               3.49TB     5  22 SSD     aggregate   n02_SSD   netapp05-02
1.5.23               3.49TB     5  23 SSD     spare       Pool0     netapp05-02
24 entries were displayed.

Using the spare disk

A spare disk is available (I am surprised it is not automatically added to the aggregate). First I tried replacing a broken disk with the spare. No luck.

netapp05::> storage disk replace -disk 1.5.6 -replacement 1.5.1 -action start
Error: command failed: Disk "1.5.6" is not in present state.

Ok, try to temporarily reactivate the disks:

netapp05::> set advanced
netapp05::*> disk unfail -disk 1.5.6
netapp05::*> disk unfail -disk 1.5.9
netapp05::*> disk unfail -disk 1.5.12

netapp05::*> aggr show-status

Owner Node: netapp05-01
 Aggregate: n01_SAS (online, raid_dp, reconstruct, degraded) (block checksums)
  Plex: /n01_SAS/plex0 (online, normal, active, pool0)
   RAID Group /n01_SAS/plex0/rg0 (reconstruction 0% completed, block checksums)
                                                              Usable Physical
     Position Disk                        Pool Type     RPM     Size     Size Status
     -------- --------------------------- ---- ----- ------ -------- -------- ----------
     shared   1.5.10                       0   SAS    10000   1.49TB   1.64TB (normal)
     shared   1.5.3                        0   SAS    10000   1.49TB   1.64TB (normal)
     shared   1.5.5                        0   SAS    10000   1.49TB   1.64TB (normal)
     shared   1.5.7                        0   SAS    10000   1.49TB   1.64TB (normal)
     shared   1.5.9                        0   SAS    10000   1.49TB   1.64TB (normal)
     shared   1.5.11                       0   SAS    10000   1.49TB   1.64TB (normal)
     shared   1.5.1                        0   SAS    10000   1.49TB   1.64TB (normal)
     shared   1.5.13                       0   SAS    10000   1.49TB   1.64TB (normal)
     shared   1.5.14                       0   SAS    10000   1.49TB   1.64TB (normal)
     shared   1.5.4                        0   SAS    10000   1.49TB   1.64TB (normal)
     shared   1.5.6                        0   SAS    10000   1.49TB   1.64TB (reconstruction 0% completed)
     shared   1.5.8                        0   SAS    10000   1.49TB   1.64TB (normal)
     shared   1.5.2                        0   SAS    10000   1.49TB   1.64TB (normal)
     shared   FAILED                       -   -          -   1.49TB       0B (failed)

So 1.5.9 is currently running (but I do not trust it), 1.5.12 stays dead, and 1.5.6 is doing something...

Moving from an old NetApp

To replace the broken disks I used drives from another NetApp. The disks are identical, same product number. But that disk was not cleanly removed and cannot be read just like that:

netapp05::*> storage disk show
                     Usable           Disk    Container   Container
Disk                   Size Shelf Bay Type    Type        Name      Owner
---------------- ---------- ----- --- ------- ----------- --------- --------
1.5.0                     -     5   0 unknown unsupported -         -
1.5.1                1.63TB     5   1 SAS     shared      n01_SAS   netapp05-02
1.5.2                1.63TB     5   2 SAS     shared      n01_SAS, n01_root
[...]

netapp05::*> storage disk show -disk 1.5.0
                  Disk: 1.5.0
        Container Type: unsupported
            Owner/Home: -  / -
               DR Home: -
    Stack ID/Shelf/Bay: 1  / 5  / 0
                   LUN: 0
                 Array: N/A
                Vendor: NETAPP
                 Model: X427_HCBFE1T8A10
         Serial Number: -
                   UID: 5000CCA0:2C55A2F0:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000
                   BPS: 520
         Physical Size: 0B
              Position: present
Checksum Compatibility: block
             Aggregate: -
                  Plex: -
Paths:
                                LUN  Initiator Side        Target Side                                                                              Link
Controller         Initiator     ID  Switch Port           Switch Port           Acc Use  Target Port              TPGN    Speed      I/O KB/s          IOPS
------------------ ---------  -----  --------------------  --------------------  --- ---  -----------------------  ------  -------  ------------  ------------
netapp05-02        0a             0  N/A                   N/A                   AO  INU  5000cca02c55a2f2             86  12 Gb/S             0             0
netapp05-02        0b             0  N/A                   N/A                   AO  RDY  5000cca02c55a2f1             55  12 Gb/S             0             0
netapp05-01        0a             0  N/A                   N/A                   AO  INU  5000cca02c55a2f1             55  12 Gb/S             0             0
netapp05-01        0b             0  N/A                   N/A                   AO  RDY  5000cca02c55a2f2             86  12 Gb/S             0             0

Errors:
The node is configured with All-Flash Optimized personality and this disk is not an SSD. The disk needs to be removed from the system.

After a lot of searching I found out that this indicates self-encrypting disks (SED). Fortunately I still had access to the old cluster and could reinsert the disk there. The two volumes on those disks were moved (likely not needed anymore, but you never know) using volume move to the remaining aggregates, then the old SAS aggregate was deleted. After plenty of trial and error, the following steps finally worked:

set d
node run netapp-master-01 -command disk remove_ownership 0a.00.0P1
node run netapp-master-01 -command disk remove_ownership 0a.00.0P2
node run netapp-master-01 -command disk remove_ownership 0a.00.0

system node run -node netapp-master-01 disk unpartition 0a.00.0

storage encryption disk modify -disk 1.0.0 -fips-key-id 0x0
storage encryption disk modify -disk 1.0.0 -data-key-id  0x0

netapp01::> storage disk show
                     Usable           Disk    Container   Container
Disk                   Size Shelf Bay Type    Type        Name      Owner
---------------- ---------- ----- --- ------- ----------- --------- --------

Info: This cluster has partitioned disks. To get a complete list of spare disk
      capacity use "storage aggregate show-spare-disks".
1.0.0                1.63TB     0   0 SAS     spare       Pool0     netapp-master-01


netapp01::> storage encryption disk show
Disk     Mode Data Key ID
-------- ---- ----------------------------------------------------------------
1.0.0    data 000000000000000002000000000001000B8C0C4412BBFE9EDB2951E40BE463E6

So now the disk was finally marked as spare and open, and the new cluster recognized it.

storage disk assign -disk 1.5.2 -owner netapp05-01 -data

netapp05::storage disk*> storage disk show
                     Usable           Disk    Container   Container
Disk                   Size Shelf Bay Type    Type        Name      Owner
---------------- ---------- ----- --- ------- ----------- --------- --------
---------------- ---------- ----- --- ------- ----------- --------- --------
1.5.1                1.63TB     5   1 SAS     shared      n01_SAS   netapp05-02
1.5.2                1.63TB     5   2 SAS     shared      -         netapp05-01

Now this disk only has to replace the FAILED slot in the aggregate. That seems to happen automatically.

netapp05::storage disk*> storage aggregate show-status

Owner Node: netapp05-01
 Aggregate: n01_SAS (online, raid_dp, reconstruct, degraded) (block checksums)
  Plex: /n01_SAS/plex0 (online, normal, active, pool0)
   RAID Group /n01_SAS/plex0/rg0 (reconstruction 0% completed, block checksums)
                                                              Usable Physical
     Position Disk                        Pool Type     RPM     Size     Size Status
     -------- --------------------------- ---- ----- ------ -------- -------- ----------
     shared   1.5.10                       0   SAS    10000   1.49TB   1.64TB (normal)
     shared   1.5.3                        0   SAS    10000   1.49TB   1.64TB (normal)
     shared   1.5.5                        0   SAS    10000   1.49TB   1.64TB (normal)
     shared   1.5.7                        0   SAS    10000   1.49TB   1.64TB (normal)
     shared   1.5.9                        0   SAS    10000   1.49TB   1.64TB (normal)
     shared   1.5.11                       0   SAS    10000   1.49TB   1.64TB (normal)
     shared   1.5.1                        0   SAS    10000   1.49TB   1.64TB (normal)
     shared   1.5.13                       0   SAS    10000   1.49TB   1.64TB (normal)
     shared   1.5.14                       0   SAS    10000   1.49TB   1.64TB (normal)
     shared   1.5.4                        0   SAS    10000   1.49TB   1.64TB (normal)
     shared   1.5.6                        0   SAS    10000   1.49TB   1.64TB (normal)
     shared   1.5.8                        0   SAS    10000   1.49TB   1.64TB (normal)
     shared   1.5.2                        0   SAS    10000   1.49TB   1.64TB (reconstruction 0% completed)
     shared   FAILED                       -   -          -   1.49TB       0B (failed)

Fixing the LUN

Starting point

After the cluster and the aggregates were back up, the servers connected via iSCSI still refused to start. The SMB share was reachable... interesting.

Bring online

In the NetApp web GUI the IOPS peaks showed up every 5 seconds, every time the server attempted to boot and then reported "no bootable device". Under the LUN actions I found the option "bring online", which immediately returned an alert: The volume is in nvfailed state

After a quick search I found: https://kb.netapp.com/on-prem/ontap/OHW/OHW-KBs/lun_online_fails_with_Error_The_volume_is_in_nvfailed_state

netapp05::> ucadmin show
                       Current  Current    Pending  Pending    Admin
Node          Adapter  Mode     Type       Mode     Type       Status
------------  -------  -------  ---------  -------  ---------  -----------
netapp05-01   0c       cna      target     -        -          online
netapp05-01   0d       cna      target     -        -          online
netapp05-01   0e       cna      target     -        -          online
netapp05-01   0f       cna      target     -        -          online
netapp05-02   0c       cna      target     -        -          online
netapp05-02   0d       cna      target     -        -          online
netapp05-02   0e       cna      target     -        -          online
netapp05-02   0f       cna      target     -        -          online
8 entries were displayed.

netapp05::> network interface show
            Logical    Status     Network            Current       Current Is
Vserver     Interface  Admin/Oper Address/Mask       Node          Port    Home
----------- ---------- ---------- ------------------ ------------- ------- ----
Cluster
            netapp05-01_clus1
                         up/up    169.254.214.208/16 netapp05-01   e0a     true
            netapp05-01_clus2
                         up/up    169.254.52.115/16  netapp05-01   e0b     true
            netapp05-02_clus1
                         up/up    169.254.159.191/16 netapp05-02   e0a     true
            netapp05-02_clus2
                         up/up    169.254.244.129/16 netapp05-02   e0b     true
netapp05
            bkup-lif_1   up/up    172.16.19.222/24   netapp05-01   a0a-1619
                                                                           true
            bkup-lif_2   up/up    172.16.19.223/24   netapp05-02   a0a-1619
                                                                           true
            cluster_mgmt up/up    172.16.17.221/24   netapp05-01   e0M     true
[...]
39 entries were displayed.

Everything up/up... But this:

netapp05::> lun show
Vserver   Path                            State   Mapped   Type        Size
--------- ------------------------------- ------- -------- -------- --------
svm10     /vol/IIL_Insight/IIL_Insight    nvfail  mapped   vmware     1.95TB
svm11     /vol/IIL_1/IIL_1                nvfail  mapped   vmware     1.95TB
svm11     /vol/IIL_1_clone_300/IIL_1      nvfail  unmapped vmware     1.95TB
svm11     /vol/IIL_1_clone_371/IIL_1      nvfail  unmapped vmware     1.95TB
svm12     /vol/IIL_2/IIL_2                nvfail  mapped   vmware     1.95TB
svm13     /vol/IIL_3/IIL_3                nvfail  mapped   vmware     1.95TB
svm14     /vol/IIL_4/IIL_4                nvfail  mapped   vmware     1.95TB

netapp05::> lun online -vserver svm11 -path /vol/IIL_1/IIL_1

Error: command failed: The volume is in nvfailed state

Not good...

netapp05::*> volume modify  -vserver svm11 -volume IIL_1 -in-nvfailed-state false
Volume modify successful on volume IIL_1 of Vserver svm11.

netapp05::*> lun online -vserver svm11 -path /vol/IIL_1/IIL_1

That was refreshingly simple...